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Abstract — In this paper, a contrastive evaluation of massive 
parallel implementations of suffix tree and suffix array to acceler- 
ate genome sequence matching are proposed based on Intel Core 
i7 3770K quad-core and NVIDIA GeForce GTX680 GPU(kepler 
architecture). Due to the more regular execution flow of the 
indexed binary search algorithm, the more efficient use of the 
cache memory and the space occupied by the suffix array index 
is much smaller than that of the suffix tree index, the suffix array 
clearly outperform the suffix tree using GPU. The suffix array 
is more than 99 times than that of CPU serial implementation. 
Simultaneously, the space occupation approximately 20%~30% 
relative to that of suffix tree. The experimental results show 
that the parallel matching algorithm with respect to suffix array 
is an efficient approach to high performance bioinformatics 
applications. 

Keywords — suffix array, suffix tree, GPU, binary search, bioin- 
formatics 

I. Introduction 

In recent years, modern multi-core and many-core architec- 
tures are revolutionizing high-performance computing (HPC). 
As more and more processor cores are being incorporated into 
a single chip, the era of the many-core processor has begun. 
The emergence of many-core architectures, such as compute 
unified device architecture(CUDA)-enabled GPUs [12], and 
other accelerator technologies (e.g. field-programmable gate 
arrays (FPGAs) and the Cell/BE), which provide the oppor- 
tunity to significantly reduce the runtime of many biological 
algorithms on commonly available and inexpensive hardware 
with more powerful high-performance computing power. Since 
the introduction of Compute Unified Device Architecture 
(CUD A) in 2007, more than 100 million computers with 
CUDA capable Graphics Processing Units (GPU) have been 
shipped to end users. In the golden age of the GPU computing, 
with such a low barrier of entry, researchers all over the 
world have been engaged in developing new algorithms and 
applications to utilize the extreme floating point execution 
throughput of these GPUs. 

Life science have emerged as a primary application area 
for the use of GPU computing. High-throughput techniques 
for DNA sequencing and gene expression analysis have led 



to a data explosion. Prominent examples are the growth of 
DNA sequence information in NCBI's GenBank database and 
the growth of protein sequences in the UniProtKB/TrEMBL 
database. Furthermore, emerging next-generation sequencing 
technologies ifTBI have broken many experimental barriers to 
genome scale sequencing. Due to GPU performance grows 
faster than CPU performance, therefore, the use of GPUs in 
bioinformatics is a perfect match. 

Suffix tree of a string is the compact trie of all its suffixes, 
it's widely used in bioinformatics applications^, e.g., MUM- 
mer HO) and MUMmerGPU El. It's can be constructed in 
linear time by the length of the string |3 II 18'lfTTl. Nevertheless, 
with the growth of the reference sequence, suffix tree also 
stuck in bottleneck because of Dynamic Random Access 
Memory(DRAM) consumption. Due to the efficient usage of 
the cache memory and the space occupied by suffix array 
only approximately 20% ~ 30% relative to that of suffix 
tree, suffix array are sometimes preferred to suffix tree. That 
is, genome sequence search can be efficiently solved with 
suffix array. Simultaneously in CPU only take 0(n) time 
to implement suffix array |8||13|. In this paper, we use the 
GPU implementations(suffix tree and suffix array) to accelerate 
genome sequence matching on two different platforms: multi- 
core(CPUs) and many-core(GPU). The GPU implementations 
show a tremendous performance, the suffix array is more than 
99 times than that of CPU serial implementation and the suffix 
trees speedup is approximately to 44-fold. 

II. Massively Parallel Processors 

Since 2003, the semiconductor industry has settled on two 
main trajectories for designing microprocessor |7|. The many- 
core trajectory focuses more on the execution throughput of 
parallel applications, in contrast with the multicore trajec- 
tory^. e. maintain the execution speed of sequential programs 
while moving into multiple cores). The many-cores began 
as a large number of much smaller cores, and, once again, 
the number of cores doubles with each generation. A typical 
exemplar is NIVIDIA® GPU, each core is a in-order, heavily 
multi-threaded, single-instruction issue processor that shares 
its control and instruction cache with other cores. Due to the 



differences in the fundamental design philosophies between 
the two types of processor, as illustrated in Figure [T[ which 
cause an enormous performance gap. 




Fig. 1. CPUs and GPUs have fundamentally different design philosophies. 



The original design philosophy of the GPUs is shaped by 
the fast growing video game industry, which exerts tremendous 
economic pressure for the ability to perform a massive number 
of floating-point calculations per video frame in advanced 
games. The hardware takes advantage of a large number of 
execution threads to find work to do when some of them are 
waiting for long-latency memory accesses, thus minimizing 
the control logic required for each execution thread. Small 
cache memories are provided to help control the bandwidth 
requirements of these applications so multiple threads that 
access the same memory data do not need to all go to the 
DRAM. As a result, much more chip area is dedicated to the 
floating-point calculations. 

Because of several limitations, General-purpose program- 
ming using a graphics processing unit(GPGPU) was replaced 
by CUDA fl2l . CUDA programs no longer go through the 
graphics interface at all. Instead, a new general-purpose par- 
allel programming interface on the silicon chip serves the 
requests of CUDA programs. In CUDA programming model, 
all threads in a grid execute the same kernel function, they 
rely on unique coordinates to distinguish themselves from 
each other and to identify the appropriate portion of the 
data to process. All threads are organized into a two-level 
hierarchy using unique coordinates - blockIdx{block index) 
and threadldx(thiead index). gridDim and blockDim provide 
the dimension of the grid and the dimension of each block 
respectively. According to gridDim and blockDim, dynamic 
partitioning of resources can result in subtle interactions be- 
tween resource limitations(shared memory, registers). 

In modern software applications, program sections often 
exhibit a rich amount of data parallelism, a property allowing 
many arithmetic operations to be safely performed on program 
data structures in a simultaneous manner. The CUDA devices 
accelerate the execution of these applications by harvesting 
a large amount of data parallelism. Here, we just presented a 
short, informal discussion of GPU architecture and CUDA fun- 
damentals |15][14|[9|. In recent years, a significant amount of 
new and interesting technologies have sprung up to efficiently 
solve problem in scientific research and commercial applica- 
tions. There are several programming languages suitable for 
GPU programming, except for CUDA, the most common being 
OpenCL B, OpenACC, and C++ AMP 0. 




Fig. 2. The suffix tree of S = acggtacgtac 



III. Suffix Array Construction 



Given a reference sequence S — siS2».s n . For i = 
1,2,. every S(i,n) is a suffix of S. Initially, for the 
four nucleic acid bases include adenine, guanine, thymine, and 
cytosine that makes up Deoxyribonucleic acid(DNA), define 
X = {a, c, g,t}. We shall label the suffixes according to the 
location of the starting character, that is, Si — S(i,n). For 
example, if S = acggtacgtac, = cggtacgtac and Sg = gtac, 
just like the left-hand side of Figure [3] A suffix tree of S of 
length n is a tree with the following properties: 1) Each tree 
edge is labeled by a subsequence of S. 2) Each internal node 
has at least two children. 3) For 1 < i < n, each Si has its 
corresponding labeled path from root to a leaf. 4) No edges 
branching out from the same internal node can start with the 
same character. The typical suffix tree is illustrated as Figure 

El 



The suffix tree can be constructed from the reference 
sequence in 0{n) linear time |fl8l . Nevertheless, all the sig- 
nificant features provided by suffix tree are offered at the cost 
of an important drawback, related to the amount of space that 
is required to store this index structure, which can be nearly 
20-fold with respect to the initial reference size. Afterwards, 
the suffix tree can be transmitted into flatten tree consisting of 
an array of edges and the result of the GPU accelerate genome 
sequence alignment based on flatten tree is efficient, which was 
presented by G. Encarnaijao, N. Sebastiao, and N. Roma 
presented in 2011. 



Still, when compared to suffix trees, the suffix arrays 
are regarded as a more space-efficient implementation. This 
structure can be served as an array of integers representing 
the start position of every lexicographically ordered suffix of 
a string, which is illustrated by Figure [3] For suffix array in 
table HJ defined as SA. 




Fig. 3. For a reference sequence S = acggtacgtac, all the suffixes are added 
into the array and sorted lexicographically so as to binary search. 



TABLE I. 



The sorted suffix index table FOR S = acggtacgtac 
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The most straightforward way to construct the suffix array 
is to simply create an array with all the suffix elements placed 
in ascending order and then apply a sorting algorithm to 
properly sort the suffixes. We can utilize Difference Cover 
Modulo 3(DC3) algorithm to implementation in 0(n) time. 

A linear time suffix array construction algorithm, namely 
DC3. It takes the following 2/3-recursive divide-and-conquer 
approach: 

1. Construct the suffix array of the suffixes starting at position 
i mod 3 0. 

For k = 0, 1, 2, define B^ — {i £ [0, n] \ i mod 3 = k}. 
Let C = Bi 1J B 2 be the set of sample positions and Sc the set 
of sample suffixes. For a reference sequence S = acggtacgtac, 
B 1 = {1, 4, 7, 10}, B 2 = {2, 5, 8} and C = {1, 4, 7, 10, 2, 5, 8}. 
For k = 1,2, construct the strings Ri = [egg] [tac] [gta] 
[cOO], R 2 = [ggt] [acg] [tac] and R = RxQ R 2 = [egg] 
[tac] [gta] [cOO] [ggi] [acg] [tac]. Radix sort[l | the characters 
are of R and rename them with ranks to obtain rank(Si). 



TABLE II. 



FOR S = acggtacgtac, THE RANKS OF THE SORTED 
SAMPLE SUFFIXES 
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2. Construct the suffix array of the remaining suffixes using 
the result of the first step. 

Represent each non-sample suffix Sj € Sb with the 
pair (Si,rank(S i+ i)). For all i,j £ B , Si < Sj 
(Si,rank(Si + i)) < (Sj,rank(Sj + i)). The pairs are then 
radix sorted. For S — acggtacgtac, Bo — {0, 3, 6, 9}. Because 

of (a, 2) < (a, 3) < (c, 5) < (g, 7), S < S 9 < S 6 < S 3 . 

3. Merge the two suffix arrays into one using a standard 
comparison-based merging. To compare suffix Si € Sc with 
Sj £ Sb , There are two cases 

1) i € -Bi,Si < (ti,rank(Si+i)) < {tj,rank{Sj+\)) 

2) i € < Sj (U,t l+ i,rank(S i+2 )) < (tj,tj + x, 
rank(Sj +2 )) 



Consequently, the time complexity of algorithm DC3 is 
0(n). Several parallel and hierarchical memory models of 
computations are presented in [ 8 1 IT71 . 

IV. Indexed Parallel Search 

After the suffixes are sorted, the indexed structure is used 
for searching query sequence P. For a reference sequence S = 
acggtacgtac, from the table [I] and Figure [3] which show that 
the first suffix Si is denoted as ac, S5 = eggtaegtae and S7 
= ggtaegtac. If P = c, we would find them in locations 4 to 
6 in the sorted suffix index table. Similarly, P = a appears at 
three locations, namely 10, 1, 6 in reference sequence S. This 
just goes to illustrate that for every query sequence P which 
appears in S, there is a range which it appears in the sorted 
suffix index table. The range is defined by its left boundary, 
denoted as LB, and its right boundary, denoted as RB. For P 
= a, LB = 1 and RB = 3. For P = c, LB = 4 and RB = 6. 
For P = ggtac, LB = RB = 7. Therefore, our job is to search 
for LB and RB. 

Given a query sequence P, in usual, binary search al- 
gorithm is an ideal strategy. During the binary search, we 
compare P with a suffix Si in the suffix array. There are two 
possibilities: 1) P is a prefix of Sj and 2) P is not a prefix of 
Si. 

1. If P is a prefix of Si, LB may be i or to the left of i and 
RB may only be i or the right of i. Thus search both left and 
right of Si is necessary. 

2 If P is not a prefix of S, and P < Sj, both LB and RB 
must be to the left of Sj if P occurs in S. 

3 If P is not a prefix of S 4 and P > Sj, both LB and RB 
must be to the right of Sj if P occurs in S. 

If P is a prefix of Sj and P is not a prefix of Sj_i, then 
LB — i. Similarly, RB = i if P is a prefix of Sj and P is 
not a prefix of Sj+i. If S = acggtacgtac, for P = c, LB = 4 
and RB = 6. Assume that P = tac, LB = 10 and RB = 11. 
Both LB and RB can be obtained by binary search. 

On the basis of CUDA, the algorithm is modified 
for multiple query sequences matching concurrently, the 
pseudocode is depicted in Alg{T] In the kernel function 
cudaGene Bin Search, thd is a private register with respect 
to a thread respectively, which denote as thread index. All 
query sequences are organized to a string array QuerySeq so 
that the exclusive thread can be mapped into QuerySeq[i/«i]. 
The reference sequence is constructed as suffix array so as 
to the indexed structure enable all threads to string matching 
simultaneously by means of binary search algorithm supported 
by GPU. The boundary of the reference sequence can be 
determined by variables, that is, left and right. Assume that 
the length of the reference sequence is n and the length of 
the query sequence is m, the algorithm take 0(m log n) time 
to determine L£?(the left boundary of the query sequence) or 
i?£?(the right boundary of query sequence). (LB,RB) is the 
matching results with respect to QuerySeq[ift,<i]. If (LB,RB) 
is NU LL, there is no suffix similar to query sequence. From 
alg{T[ when pivot is determined by (L + R) 1, we 
extract a part of sequence segments from global memory 
to shared memory until middleSuffixLength < or query- 
SequenceLength < 0, i.e., S[SA[pivot]]— > midSuffix and 



Benchmark test suffix tree and suffix array binary search algorithms in multi-core CPUs 



Benchmark test suffix tree and suffix array binary search algorithms in GPU 
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Fig. 4. Benchmark test of suffix tree and suffix array indexed based search algorithms in multi-core CPUs and GPU. 
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Fig. 5. kernel execution and Communication time for the algorithms in the Fig. 6. The speedup ratio of the algorithms between CPU and GPU. 
NIVIDIA GeForce GTX 680 GPU. 



QuerySeq[t/iG?] — ¥ querySequence. Global memory is large but 
slow, whereas the shared memory is small but fast. A common 
strategy is to partition the data into subsets called tiles such that 
each tile fits into the shared memory. It is an effective strategy 
for achieving high performance in virtually all types of parallel 

computing systems. In 43rd pseudocode line, syncthreads() 

function be viewed as a command to be waiting all threads 
execution in a block, namely synchronization. Only when 
all threads execute over here, the result of the boundaries 
(LB,RB) with respect to QuerySeq[£/j<i] can be assigned and 
transfered to host. 

V. Results 

The indexed based search algorithms were evaluated in 
a computer composed of an Intel Core i7 3770K quad-core 
processor, running at 3.5GHz, with 1600MHz and 16GB of 
DRAM. It's worth mentioning that the graphics card is NI- 
VIDIA GeForce GTX 680, which is Kepler GPU architecture, 
with 1536 stream processors running at 6008MHz and 2GB of 
RAM. 

We extracted DNA sequence from NCBI Nucleotide to 
evaluate the performance of the previously described algo- 
rithms. The reference sequence, which was used to build 



the indexed structure(suffix tree and suffix array), corre- 
sponds to the first 10 7 nucleotides of the NT 167 186.1 
Homo sapiens chromosome 1 genomic contig. Multiple query 
sequences(considered set) are 1024 nucleotides long, which 
derive from a mix of the DNA sequences extracted from the 
NT_167 186.1 Homo sapiens chromosome 1 genomic contig 
and the NT_039173.8 Mus musculus strain C57BL6J chro- 
mosome 1 genomic contig. Several sets of query sequences 
were used in the experimental test, each one consists of a 
different number of elements, ranging from 512 to 2097152 
query sequences. 

In the preliminary estimation, for evaluating the best per- 
formance, the parallel implementations of genome sequence 
search algorithms based either on the suffix tree or on the suffix 
array index. In a homogeneous multi-core CPU, the algorithms 
were executed using OpenMP to support the parallel execution 
of 2, 4 and 8 concurrent threads. Simultaneously it also 
provided a comparative evaluation with highly efficient CPU 
based software named MUMmerlTOll. 

The results is illustrated in the left side of Figure [4] 
Although suffix tree take 0(logm) time to search and suffix 
array take 0(mlogn) time to search, due to a more efficient 
usage of the cache memory by the suffix array, the asymptotic 
runtime correspond to the suffix array is slightly greater than 



Algorithm 1 Kernel function cudaGeneBinSearch: GPU 
threads parallel matching multiple genome sequence using 
binary search algorithm 

1 //geneSubsequence[thd] is one of genome subsequences 

2 thd = threadIdx.x+blockDim.x*blockIdx.x; 

3 //assign the boundaries to registers 

4 L = left, R = right 

5 shared midSuffix[segLength], querySequence[segLength] 

6 //an 0(m log n) search algorithm to determine LB 

7 while(R > L + 1) 

8 { 

9 pivot = (L + R) > 1 

10 do { 

11 adjust middleSuffxLength and querySequenceLength 

12 //update and extract trailing pair bases 

13 //send extracted segment to shared memory 

14 midSuffix = extract segment from S[SA[pivot]] 

15 querySequence = extract segment from QuerySeq[thd] 

16 if (midSuffix ! = querySequence) break 

17 } while (middleSuffxLength > && querySequenceLength 
>0) 

18 if (midSuffix < querySequence) 

19 then R = pivot 

20 else 

21 then L = pivot 

22 } 

23 LB = R 

24 L = left, R = right 

25 I I an O(mlogn) search algorithm to determine RB which 
similar to LB 

26 while(R > L + 1) 

27 { 

28 pivot = (L + R) > 1 

29 do { 

30 adjust middleSuffxLength and querySequenceLength 

31 I /update and extract trailing pair bases 

32 //send extracted segment to shared memory 

33 midSuffix = extract segment from S[SA[pivot]] 

34 querySequence = extract segment from QuerySeq[thd] 

35 if (midSuffix ! = querySequence) break 

36 } while (middleSuffxLength > && querySequenceLength 
>0) 

37 if (midSuffix < querySequence) 

38 then R = pivot 

39 else 

40 then L = pivot 

41 } 

42 RB = L 

43 syncthreads() 

44 res [thd « 1] = LB 

45 res[thd < 1 + 1] = RB 



that of the suffix tree. It also observed that the parallel 
implemented suffix tree and suffix array are significantly faster 
than MUMmer. 

Furthermore, the performance of the parallel algorithms 
was evaluated by NIVIDIA GeForce GTX 680. The obtained 
results are depicted in the right side of Figure [4] These results 
correspond to the total runtime of the search algorithms, which 
considered communication time(transfer data between host and 
device) and kernel execution time. The performance of the sep- 
arated components were portrayed in Figure [5] Just like MUM- 
mer based on CPU, there is a genome sequence alignment 
tool based on GPU architecture, namely MUMmerGPU[ 15]. 



It's worth noting that this figure not contain a comparison 
with CUDASW+, because of maximum reference size of about 
64* 10 3 base pairs is the hugely limit factor. Since the indexed 
structure must be transferred from the CPU host memory to 
the GPU device memory, the input time is critical to all index- 
based search algorithms. Indeed, when the number of query 
sequences to be searched is very small, the data input time 
occupy the vast majority of total time. Nevertheless, for a 
larger number of query sequences, the throughput of the highly 
parallel implementations by GPU amortized the input time. 
The GPU implementations show a tremendous performance, 
when the number of query sequences is about 2097152, the 
suffix array is more than 99 times than that of CPU serial 
implementation and the suffix tree's speedup is approximately 
to 44-fold. The more details are observed in Figure [6] 

Unlike what happened in CPU implementations, due to the 
achieved performance in accordance with memory accesses, 
the experimental results show that GPU implementations 
clearly favor suffix array. The reason for this situation is not 
only the more regular execution flow of this algorithm and 
its more efficient use of the cache memory, but is also the 
space occupied by the suffix array index is much smaller than 
that of the suffix tree index, which makes the suffix array 
implementation to always present a much lower transfer time 
from the host to the GPU device. From the obtained results 
it can be observed that the runtime of MUMmerGPU were 
consistently higher than the implemented suffix tree and suffix 
array, that is, the indexed search algorithm with respect to 
suffix array using GPU, even suffix tree, is efficient method to 
high performance bioinformatics applications. 

VI. Conclusion 

This paper proposed a comparative evaluation of suffix tree 
and suffix array, which are extra applicable for accelerating 
DNA sequence matching. These indexed structures were thor- 
oughly compared using two different parallel platforms: multi- 
core(i7 3770K quad-core) and many-core(NVIDIA GeForce 
GTX680 GPU). 

These observations reveal that suffix array is slightly 
greater than suffix tree though the asymptotic search time of 
the suffix array (0(nlogm)) is far higher than that of the suffix 
tree (0(n)) in multi-core platform, due to its more efficient 
use of the cache memory. Nevertheless, because of the more 
regular execution flow of this algorithm and improved usage 
of cache, the obtained results show that suffix array clearly 
outperform suffix tree in many-core platform. Suffix array 
occupy the space is approximately 20%^30% relative to that 
of suffix tree. 
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